Bay Area

Result:

Center Napa and West Oakland have the highest concentration of PM 2.5. East and middle parts of Bay areas are the next seriously polluted regions. West of North Bay areas such as Santa Rosa Has the lower PM 2.5. South part of Bay Areas such as Gilroy is also less polluted.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.500   8.262   8.564   8.480   8.748  10.522

PM2.5 stacked plot

# PM2.5 stacked plot (fill)

Result:

As compared to the proportions in the “Total”, there are higher proportions of White living in the regions with the tier of 10-11 PM2.5, and also the tiers of 5-6, 6-7, 7-8 PM2.5, especially. On the contrary, lower proportions of Asian living in these areas. More black or African Americans live in the regions with the tier of of 9-10 and 10-11 PM2.5.

Asthma Plot

Results:

The east side of Bay regions between Oakland and East Bay particularly have high proportions of Asthma prevalence.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    4.93   25.84   39.98   52.15   64.33  243.29       1

#Combine PM2.5, Asthma with race tract data

Asthma prevalence by race stacked plot

# Asthma prevalence by race stacked plot (fill)

Result: The regions with more Asthma prevalence in the levels of 100-150, 150-200, 200-250 have lower proportions of White and Asian, but higher proportions of black or African Americans, and some other race group.

Problem 2:

Best Fit

Scatter plot of PM2.5 vs. Asthma

Scatter plot of PM2.5 vs. Asthma with lm smooth line

## [1] 52.14577
## [1] 2453255

Result:

The scatter plot does not show a good fit since there are a lot of points lying above and away from the best-fit line.

Problem 3:

Regression Analysis – Optimization of SSR(Residuals)

## [1] 2453255
## $par
## [1]   19.85653 -116.23251
## 
## $value
## [1] 2217584
## 
## $counts
## function gradient 
##      117       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

## [1] 0.0006923995

Regression model with optimization

## 
## Call:
## lm(formula = Asthma ~ PM2.5, data = ces4_bay_pm25_Asthma)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.47 -25.89  -9.61  12.94 182.95 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -116.278     13.040  -8.917   <2e-16 ***
## PM2.5         19.862      1.534  12.950   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.49 on 1578 degrees of freedom
## Multiple R-squared:  0.09606,    Adjusted R-squared:  0.09549 
## F-statistic: 167.7 on 1 and 1578 DF,  p-value: < 2.2e-16

Result:

The linear regression analysis uses optimization approach which minimizes the sum of squared residuals (SSR) and gives the best fit under the assumption of a linear model. The fitted regression equation is:

Asthma prevalence = -116.278 + 19.862 * PM2.5

“An increase of “1 µg/m3” in “Annual mean concentration of PM2.5” is associated with an increase of “19.862” visits in “age-adjusted rate of ED visits for asthma per 10,000”. “9.606%” of the variation in “age-adjusted rate of ED visits for asthma per 10,000” is explained by the variation in “Annual mean concentration of PM2.5”.

##       1 
## 42.6178

Problem 4:

#Residual density Plot before log transformation

Result:

To ensure the regression line to be a good fit, the residuals from the fitted regression line need to follow a normal distribution around the “0 “mean. However, based on the residual density plot, the residual distribution is clearly right skewed instead of normal.

Best Fit after log transformation of Asthma prevalence

Scatter plot of PM2.5 vs. log Asthma

Problem 5:

Regression Analysis after log transformation of Asthma prevalence

Calculation of SSR

## [1] 2453255

Optimization of SSR(Residuals) after log transformation of Asthma prevalence

## $par
## [1] 0.3566878 0.6887687
## 
## $value
## [1] 680.3463
## 
## $counts
## function gradient 
##       65       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Regression line after log transformation of Asthma prevalence

Calculation of residuals for regression line after log transformation

## [1] 0.0005442665
## 
## Call:
## lm(formula = LN_Asthma ~ PM2.5, data = ces4_bay_pm25_LN_Asthma)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.00402 -0.46479  0.03313  0.42298  1.75525 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.69234    0.22840   3.031  0.00248 ** 
## PM2.5        0.35633    0.02686  13.264  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6566 on 1578 degrees of freedom
## Multiple R-squared:  0.1003, Adjusted R-squared:  0.09974 
## F-statistic: 175.9 on 1 and 1578 DF,  p-value: < 2.2e-16

Residual scatter plot of PM2.5 vs. residuals

Result:

Based on the residual scatter plot from the regression model after log transformation, the residuals above and under “zero slope line” is much more even which implies the under- or over- estimation situations much more equally occurred. Therefore, after the log transformation is essential and the regression line after log transformation is actually a better fit.

Residual density plot after log transformation of Asthma prevalence

Result:

The positive residual means under-estimation and the negative residual means over-estimation. Hence, a low residual As comparing the residual density plot after log transformation with the one before, it appears that the residual distribution has changed from right skewed to be a much more symmetric distribution after log transformation. The residual density now is somewhat more close to a normal distribution with 0 mean. Therefore, it shows the good fit of regression model after log transformation and further implies the necessity of log transformation of Asthma prevalence.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.00402 -0.46479  0.03313  0.00000  0.42298  1.75525
## Simple feature collection with 1 feature and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.1737 ymin: 37.41911 xmax: -122.1492 ymax: 37.44193
## Geodetic CRS:  NAD83
## # A tibble: 1 × 6
##   `Census Tract` PM2.5                       geometry Asthma LN_Asthma residuals
##            <dbl> <dbl>             <MULTIPOLYGON [°]>  <dbl>     <dbl>     <dbl>
## 1     6085513000  8.16 (((-122.1737 37.42636, -122.1…   4.93      1.60     -2.00

Result:

The lowest residual in the regression estimation is “-2.003361” which occurs in the Census Tract “6085513000” because of the over-estimation of the Asthma prevalence by the regression model.